Tidyverse: enhances data manipulation and visualization with a tidy data workflow, fostering code that is
readablemaintainablereproducibleCore packages ggplot2, dplyr, tidyr, readr, broomOur DatasetSource: Behavioral Risk Factor Surveillance System (BRFSS) 2015.
Key Features: Health indicators related to diabetes, including:
:::
What are the key predictive variables in diabetes prognosis?
How does gender influence the manifestation and progression of diabetes?
Removed Missing Values: df_cleaned <- df |> drop_na()
Verified Data Types: column_types <- summarise(df_cleaned, across(everything(), class))
Filtered Incorrect Values: Filtered out rows with values outside expected ranges.
Transformed Variables: Binary to categorical (e.g., Smoker to Smoking Status).
Created New Variables: E.g., Habits, Health Risk, based on lifestyle and health indicators.
Socio-Economic Class: Derived from income, education, and healthcare status.
Purpose: Check the relationship among numerical variables.
Two types: Between all the variables and only with the target variable (Diabetes_binary).
All variables: Creation of a GLM with all numerical variables.
Step: Step forward and backward for best variables selection.
Evaluation: Selection of lowest AIC model and analyse the selected variables.
Purpose: Decrease number of variables as we had plenty of them.
Logistic regression: Use of those components to perform a diabetes prediction model.
Evaluation: Confusion matrix and accuracy.
Purpose: Check gender influence in the diagnosis of diabetes and tackle gender bias in science.
New datasets: Creation of two different datasets according to sex.
Evaluation: Compare the selected variables for each model and selection of the best model according to AIC.
GenHlth, HighBP, and BMI emerged as significant predictors.